Data Entry Errors

We actually found that the article points out five data entry errors. On page 3, the author notes that case 42 weighs 205 pounds but measures only 29.5 inches in height. This is inaccurate, and by using the adiposity index, they were able to conclude their height should have been 69.5 inches, a value that is only off by 1 digit.

Furthermore, the author created a table at the top of page 4 that lists 3 cases (cases 48, 76, and 96) where the listed bone density values do not give rise to the two estimates of body fat percentage recorded. The author does give the apparently correct bone density values that coordinate with the body fat percentage values. Again, each of these entries were only off by 1 digit in each case.

On the same page, the author mentions that case 182 is a particularly lean individual whose predicted percentage of body fat is negative according to Siri’s and Brozek’s equation and had been truncated to zero in the dataset, which is clearly an anamoly.

In addition to the errors that the author has mentioned, we found a number of cases that had apparent data entry errors as well.

data <- read.table("http://ww2.amstat.org/publications/jse/datasets/fat.dat.txt", col.names = c("Case_number", "body_fat_Brozek", "body_fat_siri", "density", "age", "weight", "height", "adiposity_index", "fat_free_wgt", "neck", "chest", "abs", "hip","thigh", "knee", "ankle", "bicep", "forearm", "wrist"))
head(data)
##   Case_number body_fat_Brozek body_fat_siri density age weight height
## 1           1            12.6          12.3  1.0708  23 154.25  67.75
## 2           2             6.9           6.1  1.0853  22 173.25  72.25
## 3           3            24.6          25.3  1.0414  22 154.00  66.25
## 4           4            10.9          10.4  1.0751  26 184.75  72.25
## 5           5            27.8          28.7  1.0340  24 184.25  71.25
## 6           6            20.6          20.9  1.0502  24 210.25  74.75
##   adiposity_index fat_free_wgt neck chest   abs   hip thigh knee ankle
## 1            23.7        134.9 36.2  93.1  85.2  94.5  59.0 37.3  21.9
## 2            23.4        161.3 38.5  93.6  83.0  98.7  58.7 37.3  23.4
## 3            24.7        116.0 34.0  95.8  87.9  99.2  59.6 38.9  24.0
## 4            24.9        164.7 37.4 101.8  86.4 101.2  60.1 37.3  22.8
## 5            25.6        133.1 34.4  97.3 100.0 101.9  63.2 42.2  24.0
## 6            26.5        167.0 39.0 104.5  94.4 107.8  66.0 42.0  25.6
##   bicep forearm wrist
## 1  32.0    27.4  17.1
## 2  30.5    28.9  18.2
## 3  28.8    25.2  16.6
## 4  32.4    29.4  18.2
## 5  32.2    27.7  17.7
## 6  35.7    30.6  18.8
library(ggplot2)
p <- ggplot(data = data, aes(data$Case_number, data$height, color = Case_number)) +
  geom_point()+
  theme_bw()
library(plotly)
ggplotly(p)

Using the above plotly code, we plotted each of the variables against their case number to get an idea where any extreme outliers were in the midst of the data set. For example, it is easy to see that case 42 exemplifies a clear data entry error when plotting height against the case numbers because it shows that the person was only 29.5 inches in height.

library(ggplot2)
p <- ggplot(data = data, aes(data$Case_number, data$weight, color = Case_number)) +
  geom_point()+
  theme_bw()
library(plotly)
ggplotly(p)

However, it is necessary to be careful. When plotting weight against case numbers (shown above), it is clear that case 39 is an outlier, weighing 363.15 pounds, but it does not appear that this is a data entry error because the adiposity index for this person is 48.9, which is reasonable for a person as overweight as the person in case 39.

library(ggplot2)
p <- ggplot(data = data, aes(data$Case_number, data$ankle, color = Case_number)) +
  geom_point()+
  theme_bw()
library(plotly)
ggplotly(p)

It also appears that cases 31 and 86 have data entry errors in the ankle variable. These people have ankle measurements that far exceed the heaviest person in the data set (case 39), and they are extreme outliers compared to the rest of the data. It is also quite probable that case 74’s ankle measurement was entered improperly because it is far below the rest of the measurements.

Clean Data

We eliminated all of the above entries above that had errors to create a new “clean” data set.

#Remove the following rows from the original data frame
cleandata <- data[-c(42,163,221,182,48,96,76,33,98,49,6,152,237,235,169,148,31,86), ]
head(cleandata)
##   Case_number body_fat_Brozek body_fat_siri density age weight height
## 1           1            12.6          12.3  1.0708  23 154.25  67.75
## 2           2             6.9           6.1  1.0853  22 173.25  72.25
## 3           3            24.6          25.3  1.0414  22 154.00  66.25
## 4           4            10.9          10.4  1.0751  26 184.75  72.25
## 5           5            27.8          28.7  1.0340  24 184.25  71.25
## 7           7            19.0          19.2  1.0549  26 181.00  69.75
##   adiposity_index fat_free_wgt neck chest   abs   hip thigh knee ankle
## 1            23.7        134.9 36.2  93.1  85.2  94.5  59.0 37.3  21.9
## 2            23.4        161.3 38.5  93.6  83.0  98.7  58.7 37.3  23.4
## 3            24.7        116.0 34.0  95.8  87.9  99.2  59.6 38.9  24.0
## 4            24.9        164.7 37.4 101.8  86.4 101.2  60.1 37.3  22.8
## 5            25.6        133.1 34.4  97.3 100.0 101.9  63.2 42.2  24.0
## 7            26.2        146.6 36.4 105.1  90.7 100.3  58.4 38.3  22.9
##   bicep forearm wrist
## 1  32.0    27.4  17.1
## 2  30.5    28.9  18.2
## 3  28.8    25.2  16.6
## 4  32.4    29.4  18.2
## 5  32.2    27.7  17.7
## 7  31.9    27.8  17.7

Rounding Errors

In order to find the rounding errors we decided to check the various computed values against the actual computations. In this process, it would be necessary to compare the entered values with the computed values on a plotly graph, determine which values match up exactly (with neither rounding error, nor data entry error), find which values match up very closely within a specific tolerance (indicating a rounding error as opposed to a data entry error), and find which values do not match up closely and fall outside a reasonable tolerance (indicating a data entry error). This would be a reasonable way to go about this process of finding rounding errors as well. We found a total of 308 rounding errors.

Adiposity Index Comparison

#Here we compute and round (to the nearest tenth) the adiposity for each case by converting the weights from pounds to kilograms and the heights from inches to meters, then we divide the converted weight by the square of the converted height to compare these computed adiposity numbers to the listed adiposity numbers.
computed_and_rounded_adiposity <- round((data$weight/2.20462)/((data$height/39.3701)^2), digits=1)

#Here we create a new data set using the mutate function from the dplyr package to include a new column with our computed adiposity values. 
newdata <- data %>%
  mutate(computed_and_rounded_adiposity = computed_and_rounded_adiposity)
  head(newdata)
##   Case_number body_fat_Brozek body_fat_siri density age weight height
## 1           1            12.6          12.3  1.0708  23 154.25  67.75
## 2           2             6.9           6.1  1.0853  22 173.25  72.25
## 3           3            24.6          25.3  1.0414  22 154.00  66.25
## 4           4            10.9          10.4  1.0751  26 184.75  72.25
## 5           5            27.8          28.7  1.0340  24 184.25  71.25
## 6           6            20.6          20.9  1.0502  24 210.25  74.75
##   adiposity_index fat_free_wgt neck chest   abs   hip thigh knee ankle
## 1            23.7        134.9 36.2  93.1  85.2  94.5  59.0 37.3  21.9
## 2            23.4        161.3 38.5  93.6  83.0  98.7  58.7 37.3  23.4
## 3            24.7        116.0 34.0  95.8  87.9  99.2  59.6 38.9  24.0
## 4            24.9        164.7 37.4 101.8  86.4 101.2  60.1 37.3  22.8
## 5            25.6        133.1 34.4  97.3 100.0 101.9  63.2 42.2  24.0
## 6            26.5        167.0 39.0 104.5  94.4 107.8  66.0 42.0  25.6
##   bicep forearm wrist computed_and_rounded_adiposity
## 1  32.0    27.4  17.1                           23.6
## 2  30.5    28.9  18.2                           23.3
## 3  28.8    25.2  16.6                           24.7
## 4  32.4    29.4  18.2                           24.9
## 5  32.2    27.7  17.7                           25.5
## 6  35.7    30.6  18.8                           26.5
#Here we take a subset of this data frame to simplify the data of interest to only include case numbers, the given adiposity values, and our newly computed and rounded adiposity values.
adiposity_comparison <- subset(newdata, select = c(Case_number, adiposity_index, computed_and_rounded_adiposity))
head(adiposity_comparison)
##   Case_number adiposity_index computed_and_rounded_adiposity
## 1           1            23.7                           23.6
## 2           2            23.4                           23.3
## 3           3            24.7                           24.7
## 4           4            24.9                           24.9
## 5           5            25.6                           25.5
## 6           6            26.5                           26.5
#Instead of counting all of the cases where the given adiposity numbers do not match our computed and rounded adiposity numbers, we will subset this data frame to only include entries where there is indeed a difference between the given adiposity values and the computed and rounded adiposity values. 
adiposity_rounding_errors <- subset(adiposity_comparison, adiposity_index != computed_and_rounded_adiposity)
head(adiposity_rounding_errors)
##    Case_number adiposity_index computed_and_rounded_adiposity
## 1            1            23.7                           23.6
## 2            2            23.4                           23.3
## 5            5            25.6                           25.5
## 8            8            23.6                           23.5
## 9            9            24.6                           24.5
## 14          14            28.5                           28.4
#Lastly, we will count how many cases exist that have improperly rounded adiposity numbers by asking for how many rows exist in this final data frame. 
num1 <- nrow(adiposity_rounding_errors)
num1
## [1] 99

There are 99 cases where the researchers improperly rounded when computing the adiposity numbers in their data set.

Plot of Adiposity Given Values vs Adiposity Computed Values

library(ggplot2)
library(plotly)
p <- ggplot(data = newdata, aes(adiposity_index,computed_and_rounded_adiposity, color = Case_number)) +
  geom_point()+
  labs(x="adiposity", y="computed adiposity")+
  theme_bw()
ggplotly(p)

Brozek Comparison

#Here we compute and round (to the nearest tenth) the Brozek values for each case.
# % Body Fat using Brozek = 457/Density - 414.2
computed_and_rounded_brozek <- round((457/data$density)-414.2, digits=1)

#Here we create a new data set using the mutate function from the dplyr package to include a new column with our computed Brozek values. 
newdata_Brozek <- data %>%
  mutate(computed_and_rounded_brozek = computed_and_rounded_brozek)
  head(newdata_Brozek)
##   Case_number body_fat_Brozek body_fat_siri density age weight height
## 1           1            12.6          12.3  1.0708  23 154.25  67.75
## 2           2             6.9           6.1  1.0853  22 173.25  72.25
## 3           3            24.6          25.3  1.0414  22 154.00  66.25
## 4           4            10.9          10.4  1.0751  26 184.75  72.25
## 5           5            27.8          28.7  1.0340  24 184.25  71.25
## 6           6            20.6          20.9  1.0502  24 210.25  74.75
##   adiposity_index fat_free_wgt neck chest   abs   hip thigh knee ankle
## 1            23.7        134.9 36.2  93.1  85.2  94.5  59.0 37.3  21.9
## 2            23.4        161.3 38.5  93.6  83.0  98.7  58.7 37.3  23.4
## 3            24.7        116.0 34.0  95.8  87.9  99.2  59.6 38.9  24.0
## 4            24.9        164.7 37.4 101.8  86.4 101.2  60.1 37.3  22.8
## 5            25.6        133.1 34.4  97.3 100.0 101.9  63.2 42.2  24.0
## 6            26.5        167.0 39.0 104.5  94.4 107.8  66.0 42.0  25.6
##   bicep forearm wrist computed_and_rounded_brozek
## 1  32.0    27.4  17.1                        12.6
## 2  30.5    28.9  18.2                         6.9
## 3  28.8    25.2  16.6                        24.6
## 4  32.4    29.4  18.2                        10.9
## 5  32.2    27.7  17.7                        27.8
## 6  35.7    30.6  18.8                        21.0
#Here we take a subset of this data frame to simplify the data of interest to only include case numbers, the given Brozek values, and our newly computed and rounded Brozek values.
brozek_comparison <- subset(newdata_Brozek, select = c(Case_number, body_fat_Brozek, computed_and_rounded_brozek))
head(brozek_comparison)
##   Case_number body_fat_Brozek computed_and_rounded_brozek
## 1           1            12.6                        12.6
## 2           2             6.9                         6.9
## 3           3            24.6                        24.6
## 4           4            10.9                        10.9
## 5           5            27.8                        27.8
## 6           6            20.6                        21.0
#Instead of counting all of the cases where the given Brozek numbers do not match our computed and rounded Brozek numbers, we will subset this data frame to only include entries where there is indeed a difference between the given Brozek values and the computed and rounded Brozek values. 
brozek_rounding_errors <- subset(brozek_comparison, body_fat_Brozek != computed_and_rounded_brozek)

head(brozek_rounding_errors)
##    Case_number body_fat_Brozek computed_and_rounded_brozek
## 6            6            20.6                        21.0
## 8            8            12.8                        12.7
## 11          11             7.5                         7.8
## 19          19            16.1                        16.0
## 21          21            19.0                        18.9
## 30          30             9.4                         9.3
#Lastly, we will count how many cases exist that have improperly rounded adiposity numbers by asking for how many rows exist in this final data frame. 
num2 <-nrow(brozek_rounding_errors)
num2
## [1] 49

There are 49 cases where the researchers improperly rounded when computing the adiposity numbers in their data set.

Plot of Brozek Given Values vs Brozek Computed Values

library(ggplot2)
p <- ggplot(data = newdata_Brozek, aes(body_fat_Brozek, computed_and_rounded_brozek, color = Case_number)) +
  geom_point()+
  labs(x="Brozek", y="computed Brozek")+
  theme_bw()
library(plotly)
ggplotly(p)

Siri Comparison

#Here we compute and round (to the nearest tenth) the Siri values for each case.
computed_and_rounded_siri <- round((495/data$density)-450, digits=1)

#Here we create a new data set using the mutate function from the dplyr package to include a new column with our computed Siri values. 
newdata3 <- data %>%
  mutate(computed_and_rounded_siri = computed_and_rounded_siri)
  head(newdata3)
##   Case_number body_fat_Brozek body_fat_siri density age weight height
## 1           1            12.6          12.3  1.0708  23 154.25  67.75
## 2           2             6.9           6.1  1.0853  22 173.25  72.25
## 3           3            24.6          25.3  1.0414  22 154.00  66.25
## 4           4            10.9          10.4  1.0751  26 184.75  72.25
## 5           5            27.8          28.7  1.0340  24 184.25  71.25
## 6           6            20.6          20.9  1.0502  24 210.25  74.75
##   adiposity_index fat_free_wgt neck chest   abs   hip thigh knee ankle
## 1            23.7        134.9 36.2  93.1  85.2  94.5  59.0 37.3  21.9
## 2            23.4        161.3 38.5  93.6  83.0  98.7  58.7 37.3  23.4
## 3            24.7        116.0 34.0  95.8  87.9  99.2  59.6 38.9  24.0
## 4            24.9        164.7 37.4 101.8  86.4 101.2  60.1 37.3  22.8
## 5            25.6        133.1 34.4  97.3 100.0 101.9  63.2 42.2  24.0
## 6            26.5        167.0 39.0 104.5  94.4 107.8  66.0 42.0  25.6
##   bicep forearm wrist computed_and_rounded_siri
## 1  32.0    27.4  17.1                      12.3
## 2  30.5    28.9  18.2                       6.1
## 3  28.8    25.2  16.6                      25.3
## 4  32.4    29.4  18.2                      10.4
## 5  32.2    27.7  17.7                      28.7
## 6  35.7    30.6  18.8                      21.3
#Here we take a subset of this data frame to simplify the data of interest to only include case numbers, the given adiposity values, and our newly computed and rounded Siri values.
siri_comparison <- subset(newdata3, select = c(Case_number, body_fat_siri, computed_and_rounded_siri))
head(siri_comparison)
##   Case_number body_fat_siri computed_and_rounded_siri
## 1           1          12.3                      12.3
## 2           2           6.1                       6.1
## 3           3          25.3                      25.3
## 4           4          10.4                      10.4
## 5           5          28.7                      28.7
## 6           6          20.9                      21.3
#Instead of counting all of the cases where the given Siri numbers do not match our computed and rounded adiposity numbers, we will subset this data frame to only include entries where there is indeed a difference between the given Siri values and the computed and rounded Siri values. 
siri_rounding_errors <- subset(siri_comparison, body_fat_siri != computed_and_rounded_siri)
head(siri_rounding_errors)
##     Case_number body_fat_siri computed_and_rounded_siri
## 6             6          20.9                      21.3
## 48           48           5.6                      14.1
## 71           71          24.3                      24.2
## 76           76          18.5                      14.1
## 96           96          17.4                       0.4
## 100         100          22.2                      22.1
#Lastly, we will count how many cases exist that have improperly rounded Siri numbers by asking for how many rows exist in this final data frame. 
num3 <- nrow(siri_rounding_errors)
num3
## [1] 36

There are 36 cases where the researchers improperly rounded when computing the Siri numbers in their data set.

Plot of Siri Given Values vs Siri Computed Values

library(ggplot2)
p <- ggplot(data = newdata3, aes(body_fat_siri, computed_and_rounded_siri, color = Case_number)) +
  geom_point()+
  labs(x="Siri", y="computed Siri")+
  theme_bw()
library(plotly)
ggplotly(p)

Fat Free Weight Comparison

#Here we compute and round (to the nearest tenth) the Fat Free Weight values for each case.
computed_and_rounded_ffw <- round((1-(data$body_fat_Brozek/100))*data$weight, digits=1)

#Here we create a new data set using the mutate function from the dplyr package to include a new column with our computed Fat Free Weight values. 
newdata4 <- data %>%
  mutate(computed_and_rounded_ffw = computed_and_rounded_ffw)
  head(newdata4)
##   Case_number body_fat_Brozek body_fat_siri density age weight height
## 1           1            12.6          12.3  1.0708  23 154.25  67.75
## 2           2             6.9           6.1  1.0853  22 173.25  72.25
## 3           3            24.6          25.3  1.0414  22 154.00  66.25
## 4           4            10.9          10.4  1.0751  26 184.75  72.25
## 5           5            27.8          28.7  1.0340  24 184.25  71.25
## 6           6            20.6          20.9  1.0502  24 210.25  74.75
##   adiposity_index fat_free_wgt neck chest   abs   hip thigh knee ankle
## 1            23.7        134.9 36.2  93.1  85.2  94.5  59.0 37.3  21.9
## 2            23.4        161.3 38.5  93.6  83.0  98.7  58.7 37.3  23.4
## 3            24.7        116.0 34.0  95.8  87.9  99.2  59.6 38.9  24.0
## 4            24.9        164.7 37.4 101.8  86.4 101.2  60.1 37.3  22.8
## 5            25.6        133.1 34.4  97.3 100.0 101.9  63.2 42.2  24.0
## 6            26.5        167.0 39.0 104.5  94.4 107.8  66.0 42.0  25.6
##   bicep forearm wrist computed_and_rounded_ffw
## 1  32.0    27.4  17.1                    134.8
## 2  30.5    28.9  18.2                    161.3
## 3  28.8    25.2  16.6                    116.1
## 4  32.4    29.4  18.2                    164.6
## 5  32.2    27.7  17.7                    133.0
## 6  35.7    30.6  18.8                    166.9
#Here we take a subset of this data frame to simplify the data of interest to only include case numbers, the given Fat Free Weight values, and our newly computed and rounded Fat Free Weight values.
ffw_comparison <- subset(newdata4, select = c(Case_number, fat_free_wgt, computed_and_rounded_ffw))
head(ffw_comparison)
##   Case_number fat_free_wgt computed_and_rounded_ffw
## 1           1        134.9                    134.8
## 2           2        161.3                    161.3
## 3           3        116.0                    116.1
## 4           4        164.7                    164.6
## 5           5        133.1                    133.0
## 6           6        167.0                    166.9
#Instead of counting all of the cases where the given Fat Free Weight numbers do not match our computed and rounded Fat Free Weight numbers, we will subset this data frame to only include entries where there is indeed a difference between the given Fat Free Weight values and the computed and rounded Fat Free Weight values. 
ffw_rounding_errors <- subset(ffw_comparison, fat_free_wgt != computed_and_rounded_ffw)

head(ffw_rounding_errors)
##   Case_number fat_free_wgt computed_and_rounded_ffw
## 1           1        134.9                    134.8
## 3           3        116.0                    116.1
## 4           4        164.7                    164.6
## 5           5        133.1                    133.0
## 6           6        167.0                    166.9
## 8           8        153.6                    153.5
#Lastly, we will count how many cases exist that have improperly rounded Fat Free Weight numbers by asking for how many rows exist in this final data frame. 
num4 <- nrow(ffw_rounding_errors)
num4
## [1] 124

There are 124 cases where the researchers improperly rounded when computing the Fat Free Weight numbers in their data set.

Plot of Fat Free Weight Values vs Fat Free Weight Computed Values

library(ggplot2)
p <- ggplot(data = newdata4, aes(fat_free_wgt, computed_and_rounded_ffw, color = Case_number)) +
  geom_point()+
  labs(x="ffw", y="computed ffw")+
  theme_bw()
library(plotly)
ggplotly(p)

Why are there so many errors?

We really think that there are so many errors in this data set because there was some poor graduate student who had to first enter all of the data, which is not an easy task to do with 100% accuracy. Secondly, this poor graduate student would of had to do a number of calculations, unit conversions, and computations to get the values for the various variables that were based on measured quantities. That is a lot of number crunching, and it would be very easy to mistype a digit or two with such a lengthy mundane task. It certainly wouldn’t help their situation if they were doing this work on a laptop without a full keyboard or number pad like most desktop computers have. Lastly, there could be some rounding errors that come from premature truncation of computed values which would propogate the rounding error throughout the various computations. All of these thoughts are just hypotheses that stem from our perceptions of how the data set came to be.

Model to predict Brozek

We wanted to get rid of a few columns before running the following tests. By training the data, we are asking it what the best predictor of Brozek Body Fat Percentage is from the rest of the data. There are a few columns that should be taken out to eliminate confounding of the analysis. We decided to take out the columns: case number, body fat siri, density, adiposity, and fat free weight.

noco <- c(1, 3, 4, 8, 9)
newcleandata <- cleandata[,-noco]
head(newcleandata)
##   body_fat_Brozek age weight height neck chest   abs   hip thigh knee
## 1            12.6  23 154.25  67.75 36.2  93.1  85.2  94.5  59.0 37.3
## 2             6.9  22 173.25  72.25 38.5  93.6  83.0  98.7  58.7 37.3
## 3            24.6  22 154.00  66.25 34.0  95.8  87.9  99.2  59.6 38.9
## 4            10.9  26 184.75  72.25 37.4 101.8  86.4 101.2  60.1 37.3
## 5            27.8  24 184.25  71.25 34.4  97.3 100.0 101.9  63.2 42.2
## 7            19.0  26 181.00  69.75 36.4 105.1  90.7 100.3  58.4 38.3
##   ankle bicep forearm wrist
## 1  21.9  32.0    27.4  17.1
## 2  23.4  30.5    28.9  18.2
## 3  24.0  28.8    25.2  16.6
## 4  22.8  32.4    29.4  18.2
## 5  24.0  32.2    27.7  17.7
## 7  22.9  31.9    27.8  17.7
library(caret)
set.seed(580)
trainIndex <- createDataPartition(y = newcleandata$body_fat_Brozek,
                              p = 0.75,
                              list = FALSE,
                              times = 1)
trainBrozek <- newcleandata[trainIndex, ]
testBrozek <- newcleandata[-trainIndex, ]
dim(trainBrozek)
## [1] 178  14
fitControl <- trainControl(## 5-fold CV
                           method = "cv",
                           number = 5
                           )
set.seed(5)
stepMod <- train(body_fat_Brozek ~ ., data = trainBrozek, 
                 method = "leapSeq", 
                 trControl = fitControl,
                 verbose = FALSE)
stepMod
## Linear Regression with Stepwise Selection 
## 
## 178 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 142, 144, 142, 142, 142 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE      Rsquared   MAE     
##   2      4.096964  0.7350115  3.372226
##   3      4.181053  0.7254452  3.422281
##   4      4.114789  0.7321906  3.395224
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 2.
summary(stepMod$finalModel)
## Subset selection object
## 13 Variables  (and intercept)
##         Forced in Forced out
## age         FALSE      FALSE
## weight      FALSE      FALSE
## height      FALSE      FALSE
## neck        FALSE      FALSE
## chest       FALSE      FALSE
## abs         FALSE      FALSE
## hip         FALSE      FALSE
## thigh       FALSE      FALSE
## knee        FALSE      FALSE
## ankle       FALSE      FALSE
## bicep       FALSE      FALSE
## forearm     FALSE      FALSE
## wrist       FALSE      FALSE
## 1 subsets of each size up to 2
## Selection Algorithm: 'sequential replacement'
##          age weight height neck chest abs hip thigh knee ankle bicep
## 1  ( 1 ) " " " "    " "    " "  " "   "*" " " " "   " "  " "   " "  
## 2  ( 1 ) " " "*"    " "    " "  " "   "*" " " " "   " "  " "   " "  
##          forearm wrist
## 1  ( 1 ) " "     " "  
## 2  ( 1 ) " "     " "
yhat <- predict(stepMod, newdata = testBrozek)
RMSE <- sqrt(mean((testBrozek$body_fat_Brozek - yhat)^2))
RMSE
## [1] 4.185464
fitControl <- trainControl(## 5-fold CV
                           method = "cv",
                           number = 5
                           )
set.seed(1)
fsMod <- train(body_fat_Brozek ~ ., data = trainBrozek, 
                 method = "leapForward", 
                 trControl = fitControl,
                 verbose = FALSE)
fsMod
## Linear Regression with Forward Selection 
## 
## 178 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 143, 142, 142, 142, 143 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE      Rsquared   MAE     
##   2      4.138804  0.7246304  3.384212
##   3      4.179063  0.7223180  3.466164
##   4      4.204451  0.7195213  3.421430
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 2.
summary(fsMod$finalModel)
## Subset selection object
## 13 Variables  (and intercept)
##         Forced in Forced out
## age         FALSE      FALSE
## weight      FALSE      FALSE
## height      FALSE      FALSE
## neck        FALSE      FALSE
## chest       FALSE      FALSE
## abs         FALSE      FALSE
## hip         FALSE      FALSE
## thigh       FALSE      FALSE
## knee        FALSE      FALSE
## ankle       FALSE      FALSE
## bicep       FALSE      FALSE
## forearm     FALSE      FALSE
## wrist       FALSE      FALSE
## 1 subsets of each size up to 2
## Selection Algorithm: forward
##          age weight height neck chest abs hip thigh knee ankle bicep
## 1  ( 1 ) " " " "    " "    " "  " "   "*" " " " "   " "  " "   " "  
## 2  ( 1 ) " " "*"    " "    " "  " "   "*" " " " "   " "  " "   " "  
##          forearm wrist
## 1  ( 1 ) " "     " "  
## 2  ( 1 ) " "     " "
yhat <- predict(fsMod, newdata = testBrozek)
RMSE <- sqrt(mean((testBrozek$body_fat_Brozek - yhat)^2))
RMSE
## [1] 4.185464
fitControl <- trainControl(## 5-fold CV
                           method = "cv",
                           number = 5
                           )
set.seed(7)
beMod <- train(body_fat_Brozek ~ ., data = trainBrozek, 
                 method = "leapBackward", 
                 trControl = fitControl,
                 verbose = FALSE)
beMod
## Linear Regression with Backwards Selection 
## 
## 178 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 143, 142, 142, 143, 142 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE      Rsquared   MAE     
##   2      4.157698  0.7125853  3.393254
##   3      4.155764  0.7099930  3.446202
##   4      4.150568  0.7128766  3.414591
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 4.
summary(beMod$finalModel)
## Subset selection object
## 13 Variables  (and intercept)
##         Forced in Forced out
## age         FALSE      FALSE
## weight      FALSE      FALSE
## height      FALSE      FALSE
## neck        FALSE      FALSE
## chest       FALSE      FALSE
## abs         FALSE      FALSE
## hip         FALSE      FALSE
## thigh       FALSE      FALSE
## knee        FALSE      FALSE
## ankle       FALSE      FALSE
## bicep       FALSE      FALSE
## forearm     FALSE      FALSE
## wrist       FALSE      FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: backward
##          age weight height neck chest abs hip thigh knee ankle bicep
## 1  ( 1 ) " " " "    " "    " "  " "   "*" " " " "   " "  " "   " "  
## 2  ( 1 ) " " "*"    " "    " "  " "   "*" " " " "   " "  " "   " "  
## 3  ( 1 ) " " "*"    " "    " "  " "   "*" " " " "   " "  " "   " "  
## 4  ( 1 ) " " "*"    " "    " "  " "   "*" " " " "   " "  " "   " "  
##          forearm wrist
## 1  ( 1 ) " "     " "  
## 2  ( 1 ) " "     " "  
## 3  ( 1 ) "*"     " "  
## 4  ( 1 ) "*"     "*"
yhat <- predict(beMod, newdata = testBrozek)
RMSE <- sqrt(mean((testBrozek$body_fat_Brozek - yhat)^2))
RMSE
## [1] 4.156878

These findings suggest that the most important predictor of body fat given the Brozek computation is ab measurements. The second most important factor is weight. These two factors intuitively make sense given what we are calculating.

It should be noted that a different outcome could be expected if we had set the seed differently and if we had chosen to “clean” the data in a different way.